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Abstract — Content distribution over networks is often achieved 
by using mirror sites that hold copies of files or portions thereof to 
avoid congestion and delay issues arising from excessive demands 
to a single location. Accordingly, there are distributed storage 
solutions that divide the file into pieces and place copies of the 
pieces (replication) or coded versions of the pieces (coding) at 
multiple source nodes. 

We consider a network which uses network coding for multi- 
casting the file. There is a set of source nodes that contains either 
subsets or coded versions of the pieces of the file. The cost of a 
given storage solution is defined as the sum of the storage cost 
and the cost of the flows required to support the multicast. Our 
interest is in finding the storage capacities and flows at minimum 
combined cost. We formulate the corresponding optimization 
problems by using the theory of information measures. In 
particular, we show that when there are two source nodes, there 
is no loss in considering subset sources. For three source nodes, 
we derive a tight upper bound on the cost gap between the coded 
and uncoded cases. We also present algorithms for determining 
the content of the source nodes. 

Index Terms — Content distribution, information measures, 
minimum cost, network coding. 

I. Introduction 

Large scale content distribution over the Internet is a topic 
of great interest and has been the subject of numerous studies 
ID id El El • The dominant mode of content distribution is the 
client-server model, where a given client requests a central 
server for the file, which then proceeds to service the request. 
A single server location, however is likely to be overwhelmed 
when a large number of users request for a file at the same 
time, because of bottleneck constraints at a storage location 
or other network limitations in reaching that server location. 
Thus, content, such as websites or videos for download, are 
often replicated by the use of mirrors JTJ. Such issues are 
of particular interest to Content Delivery Networks (CDNs) 
|5||6|[7|, which have their own, often multi-tiered, mirroring 
topology. In other cases, content is hosted by third parties, 
who manage complex mirroring networks and direct requests 
to different locations according to the current estimate of the 
Internet's congestion, sometimes termed the weathermap, e.g., 
reference HQ describes techniques for load balancing in a 
network to avoid hot spots. One may consider the usage of 
coding for replicating the content, e.g., through erasure codes 
such as Reed-Solomon codes or fountain codes. 
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Peer-to-peer networks have also been proposed for content 
distribution in a distributed manner (E) El |9 J . However, the 
underlying content distribution mechanism in a peer-to-peer 
network is different when compared to CDNs, since they do 
not use mirror sites. Instead, a given node downloads data from 
available peers in a highly opportunistic fashion. The technique 
of network coding has also been used for content distribution 
in networks ifTol . Under network coding based multicast, the 
problem of allocating resources such as rates and flows in the 
network can be solved in polynomial time ATI . Coding not 
only allows guaranteed optimal performance which is at least 
as good as tree-based approaches 0121 . but also does not suffer 
from the complexity issues associated with Steiner tree pack- 
ings. Moreover, one can arrive at distributed solutions to these 
problems |[TDl[T3l . Recently, these optimization approaches 
have been generalized to minimize download time lfT4llfT5l . 
In these approaches, the peers, acting as source nodes, are 
given. The goal of the optimization is to reduce the download 
time by controlling the amount of information transmitted at 
different peers. As for multicast transmission optimization, the 
use of coding renders the problem highly tractable, obviating 
the difficult combinatorial issues associated with optimization 
in uncoded peer to peer networks lfl6l . 

In this work, we consider the following problem. Suppose 
that there is a large file, that may be subdivided into small 
pieces, that needs to be transmitted to a given set of clients 
over a network using network coding. The network has a 
designated set of nodes (called source nodes) that have storage 
space. Each unit of storage space and each unit of flow over a 
certain edge has a known linear cost. We want to determine the 
optimal storage capacities and flow patterns over the network 
such that this can be done with minimum cost. Underlying this 
optimization is the fact that source coding and network coding 
are not separable 1171 . Hence, there is a benefit in jointly 
considering network coding for distribution and the correlation 
among the sources (see ffTcH for a survey). Lee et al. UJl and 
Ramamoorthy et al. j20l . showed how to optimize multicast 
cost when the sources are correlated. While that problem is 
closely related to ours, since it considers correlated sources 
and optimization of delivery using such correlated sources, it 
assumes a given correlation, and no cost is associated with 
the storage. In this work, we are interested in the problem of 
design of sources. 

We distinguish the following two different cases. 

(i) Subset sources case: Each source node only contains an 
uncoded subset of the pieces of the file. 

(ii) Coded sources case: Each source node can contain arbi- 
trary functions of the pieces of the file. 
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Total cost = 14+4 = 18 Total cost = 11+6 = 17 Total cost = 10+8 = 18 



Fig. 1 . Cost comparison of three different storage schemes when a document 
[abed] needs to be transmitted to two terminals. Note that in this example, 
the case of partial replication has the lowest cost. 



We begin by showing by means of an example that storing 
independent data at each source node is not optimal in general 
as illustrated in Figure Q] which is the celebrated butterfly 
network. We consider a file represented as (a, b, c, d), where 
each of the four components has unit-entropy, and a network 
where each edge has capacity of three bits/unit time. The cost 
of transmitting at rate x over edge e is c e (x) = x, the cost of 
storage at the sources is 1 per unit storage. As shown in the 
figure, the case of partial replication when the source nodes 
contain dependent information has lower cost compared to the 
cases when the source nodes contain independent information 
or identical information (full replication). The case of subset 
sources is interesting for multiple reasons. For example, it may 
be the case that a given terminal is only interested in a part of 
the original file. In this case, if one places coded pieces of the 
original file at the source nodes, then the terminal may need 
to obtain a large number of coded pieces before it can recover 
the part that it is interested in. In the extreme case, if coding 
is performed across all the pieces of the file, then the terminal 
will need to recover all the sources before it can recover the 
part it is interested in. Note however, that in this work we 
do not explicitly consider scenarios where a given terminal 
requires parts of the file. From a theoretical perspective as 
well, it is interesting to examine how much loss one incurs by 
not allowing coding at the sources. 

A. Main Contributions 

1 ) Formulation of the optimization problems by exploiting 
the properties of information measures ( H211I ): We provide 
a precise formulation of the different optimization problems 
by leveraging the properties of the information measure (I- 
measure) introduced in [21]. This allows to provide a succinct 
formulation of the cost gap between the two cases and allows 
us to recover tight results in certain cases. 

2) Cost comparison between subset sources case and coded 
sources case: The usage of the properties of information 
measure allows us to conclude that when there are two 
source nodes, there is no loss in considering subset sources. 
Furthermore, in the case of three source nodes, we derive an 
upper bound on the cost between the two cases that is shown to 
be tight. Finally, we propose a greedy algorithm to determine 
the cost gap for a given instance. 



This paper is organized as follows. In Section [TTJ we present 
background and related work. Section|III]outlines basic results 
that allow us to apply the theory of I-measures to our problem. 
We formulate the precise problems under consideration in 
Section |IV] The cost gap between the subset case and the 
coded case is discussed in SectionM and the simulation results 
are presented in Section [VT] Section IVIII concludes the paper. 

II. Background and related work 

A. Minimum cost multicast with multiple sources problem 

Several schemes have been proposed for content distribution 
over networks as discussed previously (|[T1ll3lB1ll9l lfT0l ). In 
this section we briefly overview past work that is most closely 
related to the problem that we are considering. 

Network coding has been used in the area of large scale 
content distribution for different purposes. Several design 
principles for peer to peer streaming system with network 
coding in realistic settings are introduced in J22). Reference 
ifTUl proposed a content distribution scheme using network 
coding in a dynamic environment where nodes cooperate. 
A random linear coding based storage system (which is 
motivated by random network coding) was considered in fl23ll 
and shown to be more efficient than uncoded random storage 
system. However, their notion of efficiency is different than 
the total flow and storage cost considered in our work. The 
work of Hill , proposed linear programming formulations for 
minimum cost flow allocation network coding based multicast. 
Lee et al. |fl9l constructed minimum cost subgraphs for the 
multicast of two correlated sources. They also proposed the 
problem of optimizing the correlation structure of sources and 
their placement. However, a solution was not presented there. 
Efficient algorithms for jointly allocating flows and rates were 
proposed for the multicast of a large number of correlated 
sources by Ramamoorthy l20l (see l24l for a formulation 
where the terminals exhibit selfish behavior). The work of 
Jiang [25], considered a formulation that is similar to ours. It 
shows that under network coding, the problem of minimizing 
the joint transmission and storage cost can be formulated as 
a linear program. Furthermore, it considers a special class 
of networks called generalized tree networks and shows that 
there is no difference in the cost whether one considers subset 
sources or coded sources. This conclusion is consistent with 
the fact that network coding is not useful in tree settings. In 
contrast, in this work we consider general networks, i.e., we do 
not assume any special structure of the network. We note that 
in more recent work 11261 . network coding based distributed 
storage mechanisms and associated research issues have been 
outlined. 

The work of Bhattad et al. 11271 proposed an optimization 
problem formulation for cost minimization when some nodes 
are only allowed routing and forwarding instead of network 
coding. Our work on subset sources can perhaps be considered 
as an instance of this problem, by introducing a virtual super 
node and only allowing routing/forwarding on it. However, 
since we consider a specific instance of this general problem, 
as we allow coding at all nodes except the virtual super node, 
our problem formulation is much simpler than l27l and allows 
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us to compare the cost of subset sources vs. coded sources. 
In 1)271 , the complexity grows as the product of the number 
of edges and a factor that is exponential in the number of 
terminals. In our method, the number of constraints only grows 
linearly with the number of receivers. However, there is a set of 
constraints that is exponential in the number of source nodes. 
For most networks, we expect our formulation to be more 
efficient. In addition, we recover stronger results in the case 
when there are only two or three source nodes. Our solution 
approach uses the concept of information measures 1211 . that 
has also been used in l28l recently in other contexts. 

B. Set theory and information theory 

In this section, we introduce a few basic concepts and useful 
theorems that relate to set theory and information theory. More 
details can be found in ll2D . 

Definition 1: The field T n generated by sets 
X\ , X2 , • • • , X n is the collection of sets which can be 
obtained by any sequence of usual set operations on 
Xi, X2, • • • , X n . 

Definition 2: The atoms of T n are sets of the form n™ =1 li, 
where Yj, is either X or X?. 

Definition 3: A real function fj, defined on T n is called a 
signed measure if it is set-additive, i.e., for disjoint sets A and 
B in T n , fj,(A UB) = fi(A) + fi(B). 

We use T n to denote the field generated by 
X\,X%,--- ,X n . Define the universal set il to be the 
union of the sets Xi, X2, ■ ■ • , X n , i.e., ft = U™ =1 Xi. The set 
^0 = nf =1 Xf whose measure is /j(nf =1 Xf) = /i(0) = 0, is 
called the empty atom of T n . Let A be the set of nonempty 
atoms of T n (\A\ = 2" — 1). It can be shown that any set 
in T n can be uniquely defined as the union of some atoms. 
A signed measure \x on T n is completely specified by the 
values of the \i on the nonempty atoms of T n - 

Consider a field T n generated by n sets Xi, X2, ■ ■ • ,X n . 
Let Ms = {1)2, ••• , n} and Xy denote UievXi 
far any nonempty subset V of Ms- Define B = 
{Xy '■ V is a nonempty subset of Ms} According to the 
proof of Theorem 3.6 in IF2TI . there is a unique linear rela- 
tionship between fi(A) for A G A and fi(B) for B £ B. 
Since T n can be completely specified by (i(A), J- n can also 
be completely specified by /J.(B). 

For n random variables Xi,X2, ■ ■ ■ ,X n , let Xi be a set 
corresponding to X.;. Let Xy = (Xi,i G V), where V is 
some nonempty subset of M s . We define the signed measure by 
fJ,*(X v ) = H(Xy), for all nonempty subset V of Ms- Then 
/i* is the unique signed measure on T n which is consistent 
with all of Shannon's information measures (Theorem 3.9 in 

ED). 

III. Preliminaries 

In this section we develop some key results, that will be used 
throughout the paper. In particular, we shall deal extensively 
with the I-measure introduced in 0211 . We refer the reader to 
Oil for the required background in this area. First we note that 
it is well known that atom measures can be negative for general 
probability distributions |2"T1 . e.g., three random variables X\, 



X2 and X3, where Xi and X2 are independent, P(X.; = 1) = 
P(X =0) = 1/2, i = 1,2. X 3 = (X a +X 2 ) mod 2, then 
fi(Xi n X2 n X3) = — 1. Next we argue that in order to make 
each source node only contain a subset of the pieces of the 
file, the measure of the atoms in the fields generated by the 
sources should be non-negative. This is stated as a theorem 
below. 

Let Ms = {1, 2, • ■ • ,n}. Consider n random 
variables X±, X2, ■ ■ ■ , X n and their corresponding sets 
X 1 ,X 2 , ■ ■ ■ , X n . Let X v = yj ieV X t and Xy = (Xi , i G V), 
V C Ms- We denote the set of nonempty atoms of T n by A, 
where T n is the field generated by the sets X\ , X2 , ■ ■ • , X n . 
Construct the signed measure fj,*(Xv) = H(Xv), for all 
nonempty subset V of Ms- 

Theorem 1: (1) Suppose that there exists a set of 2™ — 1 
nonnegative values, one corresponding to each atom of T n , i.e, 
a(A) > 0,\/A G A. Then, we can define a set of independent 
random variables, Wa , A G A and construct random variables 
Xj = (Wa ■ A G A, A C X,), such that the measures of the 
nonempty atoms of the field generated by X\ , X2 , ■ ■ ■ , X n 
correspond to the values of a, i.e., n*(A) = a(A),\/A G A. 
(2) Conversely, let Zi,i G {1, ...,m} be a collection of 
independent random variables. Suppose that a set of random 
variables Xi,i = 1, ...,n is such that Xi = Zv t , where 
Vi C {1, . . . , m}. Then the set of atoms of the field generated 
by Xi,X2, ■ ■ ■ , X n , have non-negative measures. 
Proof: See Appendix. ■ 

IV. Problem Formulation 

We now present the precise problem formulations for the 
subset sources case and the coded sources case. Suppose that 
we are given a directed graph G = (V, E, C) that represents 
the network, V denotes the set of vertices, E denotes the set 
of edges, and CV,- denotes the capacity of edge (i,j) G E. 
There is a set of source nodes S C V (numbered 1, . . . ,n) 
and terminal nodes T C V, such that \T\ = m. We assume 
that the original source, that has a certain entropy, can be 
represented as the collection of equal entropy independent 
sources {OSj}j =1 , where Q is a sufficiently large integer. 
Note that this implies that H(OSj) can be fractional. Let Xi 
represent the source at the i th source node. For instance in the 
case of subset sources, this represents a subset of {OSj}^ =1 
that are available at the i th node. Suppose that each edge (i,j) 
incurs a linear cost fijZij for a flow of value Zy over it, and 
each source incurs a linear cost diH(Xi) for the information 
Xi stored. 

A. Subset Sources Case 

1) Basic formulation: In this case each source Xi,i = 
1, ... ,?i is constrained to be a subset of the pieces of the 
original source. We leverage Theorem Q] from the previous 
section that tells us that in this case that /J.*(A) > for all 
A £ A. In the discussion below, we will pose this problem 
as one of recovering the measures of the 2™ — 1 atoms. Note 
that this will in general result in fractional values. However, 
the solution can be interpreted appropriately because of the 



Fig. 2. Modified graph for the first formulation when there are three sources. 



Fig. 3. Modified graph for the second formulation. 



assumptions on the original source. This point is also discussed 
in Section HV-Bl 

We construct an augmented graph G\ = (V^*, E*, C*) 
as follows (see Figure O. Append a virtual super node s* 
and 2™ — 1 virtual nodes corresponding to the atom sources 
Wa,WI £ A and connect s* to each Wa source node. The 
node for Wa is connected to a source node i G S if A C X{. 
The capacities of the new (virtual) edges are set to infinity. 
The cost of the edge (s*, Wa) is set to 2~2u£SAcx ■} ^ ^he 
costs of the edges (Wa, Si), A c Xi are set to zero. 

If each terminal can recover all the atom sources, Wa , VA G 
A, then it can in turn recover the original source. The 
information that needs to be stored at the source node i£S, 
is equal to the sum of flows from s* to Wa,VA C Xi. Let 
xfj , tgT represent the flow variable over G\ corresponding 
to the terminal t along edge and let zy represent 

max t6 T^ J -*' l ,V(i,j) £ E. The corresponding optimization 
problem is defined as ATOM-SUBSET-MIN-COST. 

minimize T,(i,j)eE fij z ij+T,AeA(E { ieS:Acx i } d i)^*( A ) 
subject to 

o<4* } <*J <c* jtl Mi,j)eEf,t€T 

E 4 9 - E ^^.Vievr.ter 

U\(.i,j)eE*} {j\(i,i)eB*} 

x$ WA =li*(A),t€T,A€A (1) 

H*(A) >0,VAe A (2) 

H(X U X 2 --- ,X n ) = ^*( A ) « 

A:AeA 

where 

r ,x n ) ifi = s * 

= I -H{X X ,--- ,X n ) ifi = t (4) 
I otherwise. 

This is basically the formulation of the minimum cost mul- 
ticast problem ifTTI with a virtual super-source of entropy 



H(Xi, . . . ,X n ), with the added constraint that the flow on 
the edge from s* to node Wa for each terminal, x^) w is at 
least ii* (A). We also have a constraint that J^AeA ^*( A ) = 
H(Xi, X2, ■ ■ ■ ,X n ), that in turns yields the constraint that 
X 8*Wa = ^*(^)' ^Iso, note that the measure of each atom, 
H*(A) is non-negative. This enforces the subset constraints. 
Because from the non-negative measures of the atoms, we 
are able to construct random variables, which indicates the 
atom measures satisfy both Shannon type inequalities and 
non-Shannon type inequalities. Hence, the non-negative atom 
measures ensure that the corresponding entropic vectors are in 
entropy region. 

In general, the proposed LP formulation has a number of 
constraints that is exponential in the number of source nodes, 
since there are 2™ — 1 atoms. However, when the number of 
source nodes is small, this formulation can be solved using 
regular LP solvers. We emphasize, though, that the formulation 
of this problem in terms of the atoms of the distribution of 
the sources provides us with a mechanism for reasoning about 
the case of subset constraints, under network coding. We are 
unaware of previous work that proposes a formulation of this 
problem. 

In order to provide bounds on the gap between the optimal 
costs of the subset sources case and the coded sources case, 
we now present an alternate formulation of this optimization, 
that is more amenable to gap analysis. Note however, that 
this alternate formulation has more constraints than the one 
presented above. 

2) Another formulation: In the first formulation, the ter- 
minals first recover the atom sources, and then the original 
source. In this alternate formulation, we pose the problem as 
one of first recovering all the sources, X. t , i G S at each 
terminal and then the original source. Note that since these 
sources are correlated, this formulation is equivalent to the 
Slepian-Wolf problem over a network |20l . We shall first 
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give the problem formulation and then prove that the two 
formulations have the same optimums. 

We construct another augmented graph G 2 = (V 2 *, E 2 , C|) 
(see Figure [3j> using the basic network graph G = (V, E, C). 
We append a virtual super node s* to G, and connect s* and 
each source node i with virtual edges, such that its capacity 
is infinity and its cost is di. 

As before, let x^ , t £ T represent the flow variable over 
corresponding to the terminal t along edge and let 

„(*) 



R 



it) 

i,2 



(t) 
U s*i,2 



E 

£A.A< 

E 



(*) 



A:AeA,ACXi 
A:AeA„AdXi 

fi*(A) 2 = n*(A) 1 ,VA€A. 



Then R^l, x ij 2 < z ij.2> an< i the atoms [i*(A) 2 are a feasible 
solution for the second formulation. 

Proof. Flow balance for source node i £ S in the first for- 



Sr(i j)e£* x ij,2'^ e Flow balance at the internal nodes 
is trivially satisfied. We only need to check constraints © and 

In the equations below, we use A £ A (i.e., A is an atom) as 
a summation index at various terms. However, for notational 
simplicity, we do not explicitly include the qualifier, A £ A 
below. Also in the equations, we have the convention that if 
there is no edge between nodes Wa and i in G*, the flow 



Zij represent max^T^- ,V(i, j) £ E. We introduce variable mulation implies that J2a- 
R.f\t £ T that represents the rate from source i to terminal 
t, i = 1, ■ • • ,n. Thus i?<*) = (R[ t] , R { 2 t} , ■ ■ ■ , R { n ] ) represents 
the rate vector for terminal t. In order for t to recover the 
sources |29l , the rate vector needs to lie within the 
Slepian-Wolf region of the sources 

n sw = {(R u • • • , R n ) : W C 5, £ i? ? > (XcHXs^)}. 

Moreover, the rates also need to be in the capacity region 
such that the network has enough capacity to support them 
for each terminal. As before we enforce the subset constraint 
H* (A) > 0,VA £ A. The optimization problem is defined as 
SUBSET-MIN-COST. 

minimize E( 4J ) eB fij^j+^AeA^ies-.Acx,} di)l**(A) 
subject to 



,('■) 



Vi £ S. Therefore flow balance for source node 

i in the second formulation can be seen as follows: 

(t) _ V* _ (*) _ ^ (t) 

X s*i.2 — 2^,A:A£A,ACX t X W A t,l ~ l^v. 



j:(i,j)£E* 



(*) 
" J W A i 



is zero. For any U C S, we have 



E^ 



E E 



(*) 



l£U 



ieu 



A-.AcXi 



E 



E 



(0 



(t) 

r* ; ' 

E 

01(i.j)GB;} 

„(*) p(t) 



(5) 



E 



{ilO'.OGBJ} 

R m £ ^5w,Vt G T 
/i*(i4) > o,VA e ^ 

z s . i = ff(X i )= ]T /i*(4V*eS 



> 



E ^E x w A t, 

i£U A:A^X 3 \u,AcX n 



t+E E 



i£U A:AC.Xg W ,ACXv 



E 



(6) 

(7) 
(8) 
(9) 



AeA 

H(Xu\X S \u) = 2J 



(10) 

^*(A),V?7 C S* (11) 



where a\ is defined in ©. 

Now we prove the two formulations will get the same 
optimal values. The basic idea is as follows. Note that the 
objective functions for both the formulations are exactly the 
same. We shall first consider the optimal solution for the 
first formulation and construct a solution for the second 
formulation so that we can conclude that fopti > fopt2- In 
a similar manner we will obtain the reverse inequality, which 
will establish equality of the two optimal values. 

Suppose that we are given the optimal set of flows 
x ij v Z ii-Xi * ^ T,(i,j) £ E* and the optimal atom values 
H*(A)i for the first formulation, with an objective of value 

fopti' 



(*) 

V W A i,l 



E x w A i,i= E ^2 x w A i,i 

A:A£X sw ,ACXu A:A£X s \u,ACXu l ^U 

JL, x l'w A ,i v 

A:A<£X sxu ,AGXu A:A^X sxu ,AcXu 

fi*(A)2=H(X u \X S \ u ) 

A:A<£X sw ,AcXu 

(12) 

where H(Xu\Xs\u) is the conditional entropy of the second 
formulation, (a) is due to the convention we defined above. (6) 
is from the flow balance at the atom node and the convention 
we defined above, (c) comes from the constraint ([TJ in the first 
formulation. Therefore, constraints (|6]l and Q are satisfied and 
this assignment is feasible for the second formulation with a 
cost equal to f op ti- ■ 
We conclude that the optimal solution for the second formu- 
lation f opt2 will have f opt2 < f opa . 

Next we show the inequality in the reverse direction. Sup- 
pose that we are given the optimal set of flows xfj 2 , Zij >2 ,t £ 
T, £ E% and the atom values ii*(A) 2 in the second for- 
mulation. Further assume that the optimal objective function 

IS fopt2- 

Claim 2: In G*, assign 



(*) 



ij,2' z ij,i = %',2,V(i, j) £ G 



Claim 1: In GS, for the flows 



fJ,*{A) 2 , assign 



(t) 



z s *w A ,i = ^i'W.i = = l J *{ A h^ A e A. 



and the atoms 



.,<<) 



,(*) 



X ij,2 = X ij',l> Z *3,2 = Zij,lMh3) e G 



Furthermore, there exist flow variables x\^ Ai 1 and zw A i,i 
over the edge (Wa, i) £ V*, VA e A, such that together 
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with the assignment, they form a feasible solution for the first 
formulation. 




Fig. 4. An example of the graph constructed for the proof of Claim f5] where 
there are three sources. 

Proof. It is clear that the assignments for and for 
G G satisfy the required flow balance constraints. We 
need to demonstrate the existence of flow variables xfy A i \ 
and Zw A i,i over tne e dge (Wa,i) € V{, MA £ A, such that 
they satisfy the flow balance constraints. 

Towards this end it is convenient to construct an auxiliary 
graph as follows. There is a source node P* connected to 
the atoms Wa's, A € A, a terminal Q* connected to the 
sources nodes, i € S. There is an edge connecting Wa an d i 
if A C Xj. An example is shown in Figure [4] The capacity 
for edge (P*,W A ) is x 8 , Wa1 , the capacity for edge (i,Q*) 
is Xg) i2 , and the capacity for edge (Wa,i) is infinity. Note 

that = Eiesx%2 = H(X u X 2 ,-- : X n ). 

Therefore, if we can show that the maximum flow in this 
auxiliary graph between P* and Q* is H(Xi,X2, ■ ■ ■ ,X n ), 
this would imply the existence of flow variables on the edges 
between the atom nodes and the source nodes that satisfy the 
required flow balance conditions. 

To show this we use the max-flow min-cut theorem 11301 and 
instead show that the minimum value over all cuts separating 
P* and Q* is H(X U X 2 ,--- ,X n ). 

First, notice that there is a cut with value 
H{X\, X 2 , ■ ■ ■ , X n ). This cut can be simply the node 
P*, since the sum of the capacities of its outgoing edges is 
H(Xi,X 2 , ■ ■ ■ ,X n ). Next, if an atom node Wa belongs to 
the cut that contains P*, then we must have all source nodes 
i <E S such that A C Xi also belonging to the cut. To see 
this, note that otherwise there is at least one edge crossing 
the cut whose capacity is infinity, i.e., the cut cannot be the 
minimum cut. 

Let 5" C S. Based on this argument it suffices to consider 
cuts that contain, P*, the set of nodes S \ S' and the set of 
all atoms Wa such that A ^ Xs>. The value of this cut is at 
least 

V x W + V x W 

A:AeA.A<ZX s , i£S\S' 

= H(X 1 , ■ ■ • ,X n ) x % A ,i + E x %2- 

A:AeA,A<£X s , i£S\S> 



By constraints ©, (0 and the given assignment, 
we have J2a-.a&a,a<£x s , x s*w a ,i = H(X S \ S '\X S >) < 
SiGS\S' x s*i 2- This implies that the value of any cut of this 
form at least H(Xi , X 2 , ■ ■ ■ , X n ). Therefore we can conclude 
that the minimum cut over all cuts separating P* and Q* is 
exactly H(Xi, X 2 , ■ ■ ■ ,X n ), i.e., our assignment is a valid 
solution. ■ 
Using Claims [TJ and |2 we conclude that f op ti = jopti- 
As mentioned earlier, the second formulation will be useful 
when we compute the cost gap between the coded and subset 
cases, we will use the graph G* = G 2 in the rest of the paper. 

B. Solution explanation and construction 

Assume that we solve the above problem and obtain the 
values of all the atoms /i*(A), A E A. These will in general be 
fractional. We now outline the algorithm that decides the con- 
tent of each source node. We use the assumption that the orig- 
inal source can be represented as a collection of independent 
equal-entropy random variables {OSi}f =1 , for large enough 
Q at this point. Suppose that H(OS\) = (3. In turn, we can 
conclude that there exist integers a a , VA g A, such that a a x 
(3 = \i* (A), MA <E A and that J^AeA aA = Q- Consider an or- 
dering of the atoms, denoted as Ai, A 2 , ■ ■ ■ , A 2 n_i. The atom 
sources can then be assigned as follows: For each Ai, assign 

W Ai = (O^^aAj+l^S-^^.^.^, . . . ,OS-£ jiiaA )■ It 
is clear that the resultant atom sources are independent and that 
H{W A ) = n*(A),\/A e A. Now set X, = (W A ■ A c XA, 
to obtain the sources at each node. 

The assumption on the original source is essentially equiva- 
lent to saying that a large file can be subdivided into arbitrarily 
small pieces. To see this assume that each edge in the network 
has a capacity of 1000 bits/sec. At this time-scale, suppose that 
we treat each edge as unit-capacity. If the smallest unit of a 
file is a single bit, then we can consider it to be consisting of 
sources of individual entropy equal to 10~ 3 . 

C. Coded source network 

Given the same network, if we allow coded information to 
be stored at the sources, using the augmented graph G* by the 
second problem formulation, the storage at the sources can be 
viewed as the transmission along the edges connecting the 
virtual source and real sources. Then the problem becomes 
the standard minimum cost multicast with network coding 
problem (CODED-MIN-COST) (TTJ where the variables are 
only the flows Zij and . 

minimize E(i,j) £ E fij z H + E ie s d i z **i 
subject to 

0<zg } <%-<4> (i,j)eE*,teT 
E E ^ = a?,ieV*,tzT 

{j\(i,j)eE>} {j\(j,i)EE*} 

where cr| is defined in ©. Assume we have the solution for 
CODED-MIN-COST, we can use the random coding scheme 
introduced by |29'| or other deterministic coding schemes ||3TJ| 
to reconstruct the sources and the information flow of each 
edge. 
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V. Cost comparison between the coded case and 

SUBSET CASE 

For given instances of the problem, we can certainly com- 
pute the cost gap by solving the corresponding optimization 
problems SUBSET-MIN-COST and CODED-MIN-COST. In 
this section we formulate an alternate version of CODED- 
MIN-COST where we also seek to obtain the values of 
the atom measures of the sources (as we did for SUBSET- 
MIN-COST). In principle, this requires us to ensure that the 
atom measures to satisfy the information inequalities [21 1 that 
consist of Shannon type inequalities and non-Shannon type 
inequalities when n > 4. In reference l32l . it was shown 
that there are infinitely many non-Shannon type inequalities 
when n > 4. Hence, it is impossible to list all the information 
inequalities when the source number exceeds 4. Moreover, 
since the entropic region is not polyhedral, the problem is 
no longer an LP. In our optimization we only enforce the 
Shannon inequalities and remove the non-negativity constraint 
on the atom measures. In general, these atom measures may 
not correspond to the distribution of an actual coded solution. 
However, as explained below, starting with an output of our 
LP, we find a feasible instance for the SUBSET-MIN-COST 
problem and then arrive at an upper bound on the gap. 

In the general case, of n sources, even this optimization 
has constraints that are exponential in n. However, this for- 
mulation still has advantages. In particular, we are able to 
provide a greedy algorithm to find near-optimal solutions for 
it. Moreover, we are able to prove that this greedy algorithm 
allows us to determine an upper bound in the case of three 
sources, which can be shown to be tight, i.e., there exists a 
network topology such that the cost gap is met with equality. 

A. Analysis of the gap between the coded case and the subset 
case 

We now present the problem formulation for ATOM-CODED- 
MIN-COST. We use the augmented graph G* in Figure |3] 
minimize £ (iJ)eE fij z H + E,: e s d >- z s*i 
subject to < xf) < mj < c?-, V(i,j) G E*,t G T 

£ 4 9 - £ *#=^\Vi€V*,t€T 

(13) 

x%\>R^,yieS,teT (14) 

R (t) e KswM e T (15) 
ffpQ|X S \ W )>0,WeS (16) 
I(X i; Xj \X K ) >0,VieS,j eS,!/ j, KCS\ {*, j} 

(17) 

z sH =H(Xi),Vi G S; H(X 1 ,X 2 --- ,X n ) = E ft* (A) 

AeA 

(18) 

where a\ is defined in The formulation is the same as 
SUBSET-MIN-COST (Equation (O) except that we remove 
dH), and add ( TToT l and (fTTT i. that are elemental inequalities, 



which guarantee that all Shannon type inequalities are satisfied 
lETl . The constraints in ( fl6] l and (fTTT i can be represented in 
the form of atoms: 

H(X i \X S \ {i} )=(i*(A) > A£X SX{i} 
I(X i ;X j \X K ) = £ »*( A ) 

AeA: AcX, ,AcXj .A<£X K 

where K CS\{i,j}. 

Now we prove that ATOM-CODED-MIN-COST and 
CODED-MIN-COST have the same optimums. Let 
the optimum of ATOM-CODED-MIN-COST (CODED- 
MIN-COST) be f opta (f optc ). Denote ConA 
{the set of constraint of ATOM-CODED-MIN-COST} and 
ConC = {the set of constraint of CODED-MIN-COST}. 
First we note that the two LPs have the same objective 
functions, and ConC C ConA. Therefore, we should 
have f op ta > foptc- Next we note that u*(A),A G A 
are variables in ConA \ ConC (ODQDdTlJdTTjdiD). 
Let the optimal set of flows for CODED-MIN-COST be 
denoted as Xy c , Zy jC) t E T,(i,j) £ E* . Now suppose that 
fopta > foptc- Note that this assignment is infeasible for 
ATOM-CODED-MIN-COST, since f opta > f op tc- Next, since 
ConC C ConA, the constraints that cause infeasibility have 
to be in dT4j - dT8j . This implies that a feasible u*(A),AeA 
cannot be found. 

We claim that this is a contradiction. This is because if cod- 
ing is allowed at the source, then there exists a deterministic 
algorithm Oil for the multicast network code assignment with 
a virtual source connected to all the source nodes that operates 
with the subgraph induced by Zij iC , G E* . This algorithm 
guarantees the existence of random variables X\ , . . . , X n that 
correspond to the sources. This in turn implies the existence 
of atom measures that satisfy all information inequalities 
corresponding to the flow assignment Zij c> G E*. In the 
above LP, we have only enforce the elemental inequalities, 
therefore the existence for (j,*(A), A G A is guaranteed. 

Now, suppose that we know the optimal value of the above 
optimization problem, i.e., the flows xf^ , \ ,t G T, (i, j) G 
E*, the measure of the atoms n*{A)\,yA G A, and the 
corresponding conditional entropies H 1 (Xu\Xg\ u ),\/U C 
S. If we can construct a feasible solution for SUBSET- 
MIN-COST such that the flows over E* are the same as 
x^-\(and z^j\),t G T,(i,j) G E, then we can arrive at an 
upper bound for the gap. This is done below. 

Let n*(A) denote the variables for the atom measures for 
the subset case. The gap LP is, 

minimize 

E( E *K(^)-£( E 

A ^ A {ies-.AaXi} A £-A {ieS:Acx,} 

subject to 

E /i*(^)<^ 1 (^c/|^s\c/),VC/G5 (19) 

A:AeA,A£X s \u 

fi*(A) > 0, MA g A 
E H*(A) = H(X 1 ,X 2 ,--- ,X n ) 

A:AeA 
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where H\Xu\X sxu ) = E A:W ^ SW ™ C 

S. In the SUBSET-MIN-COST, we assign xf) = 

a&.ftj) e £*, 4? = *8i» G E and z ^ = 

^A-AeAAcx ^(-tyiVi G S. To see that this is feasible, 
note that 

A:A£A,AcXi 

= H(Xi, • • • , X n ) — H(Xi, • • • , • • • , X n |Xj) 

> H(Xi, ■ ■ ■ ,X n ) — H 1 (Xi, ■ ■ ■ ,Xj_i,X+i, • ■ • ,X n \Xi) 
= H 1 {X i )=z s , i , 1 

> _(*) _ T W 

This implies that constraint (0 is satisfied. 

]T a$ = £ > H\Xu\X sxu ) > HiXulX^) 

i:ieU i:ieU 

where tf(X a |X sw ) = Ea^^a^^^I^.W c 5 - 
Then constraints (O and (0 are satisfied. 

Both (a) and (6) come from constraint ( fT9l ). The differ- 
ence in the costs is only due to the different storage costs, 
since the flow costs are exactly the same. It is possible that 
the atom measures from ATOM-CODED-MIN-COST are not 
valid since they may not satisfy the non-Shannon inequalities. 
However, we claim that the solution of the Gap LP is still 
an upper bound of the difference between the coded and 
the subset case. This is because (a) we have constructed 
a feasible solution for SUBSET-MIN-COST starting with 
H*(A)i, MA <E A, and (b), as argued above, the optimal values 
of CODED-MIN-COST and ATOM-CODED-MIN-COST are 
the same. The difference between the costs in the coded case 
and the subset case are only due to the different storage costs, 
since the flows in both cases are the same. Therefore, the 
objective function of the gap LP is a valid upper bound on the 
gap- 

B. Greedy Algorithm 

We present a greedy algorithm for the gap LP that returns a 
feasible, near-optimal solution, and hence serves as an upper 
bound to the gap. The main idea is to start by saturating atom 
values with the low costs, while still remaining feasible. For 
instance, suppose that source 1 has the smallest cost. Then, the 
atom Xi n;. ej tys\.m Xfj. has the least cost, and therefore we 
assign it the maximum value possible, i.e., H 1 (Xi\X S \^). 
Further assignments are made similarly in a greedy fashion. 
More precisely, we follow the steps given below. 

1) Initialize fi*(A) = 0,VA E A. Label all atoms as 
"unassigned". 

2) If all atoms have been assigned, STOP. Otherwise, let 
A m i n denote the atom with the minimum cost that is 
still unassigned. 

• Set fi*(A m i n ) > as large as possible so that the 
sum of the values of all assigned atoms does not 
violate any constraint in ( fT9l ). 

< Check to see whether Eag-A/- 4 *^) > 
H(X U X 2 ,--- ,X n ). If YES, then reduce the 



value of fi*(A min ), so that Y,AeA^*( A ) = 
H(X U X 2 , ■ ■ ■ ,X n ) and STOP. If NO, then label 
^4min as "assigned". 
3) Go to step 2. 
It is clear that this algorithm returns a feasible set of atom 
values, since we maintain feasibility at all times and enforce 
the sum of the atom values to be H(Xi,X2, ■ ■ ■ , X n ). 

The greedy algorithm, though suboptimal, does give the 
exact gap in many cases that we tested. Moreover, as discussed 
next, the greedy approach allows us to arrive at a closed form 
expression for the an upper bound on the gap in the case of 
three sources. However, it is not clear if there is a constant 
factor approximation for the greedy algorithm. 

C. Three sources case 

The case of three sources is special because, (i) Shannon 
type inequalities suffice to describe the entropic region, i.e., 
non-Shannon type inequalities do not exist for three random 
variables. This implies that we can find three random variables 
using the atom measures found by the solution of ATOM- 
CODED-MIN-COST. (ii)JVIoreover, there is at most one atom 
measure, fi*{X\ C\X 2 ("1X3) that can be negative. This makes 
the analysis easier since the greedy algorithm proposed above 
can be applied to obtain the required bound. Let b = fj,* (X\ fl 

x 2 n x 3 ), 01 = m* (Xijn x 2 n xf), a 2 = ^*(x 2 n Xf n x§), 

a 3 = v*(X 3 n x% n Xf), a 4 = pL*{X x J\ A\n Xf), a 5 = 
H* (X x n X 3 n Xf), and a 6 = /U*(X 2 n X 3 H Xf). 

Claim 3: Consider random variables X , X and X3 with 
ff(Xi,X 2 ,X 3 ) = h. Then, b > -f . 

Proof: The elemental inequalities are given by a, > 
0,i = 1, • • • ,6 (non-negativity of conditional entropy and 
conditional mutual information) and Oj + b > 0,i = 
4,5,6 (non-negativity of mutual information). We also have 
(Xa =1 ... 6 ai) + b = h. Assume that b < —4. Then, 

a>i + b > => ai > —b > — , i = 4, 5, 6 =>■ 04 + 05 > h. 
Next, 

h = a\ + a 2 + 03 + 04 + 05 + afi + b 
> at + a 2 + a 3 + a 4 + a 5 > a\ + a 2 + a 3 + h. 

This implies that a\ + a 2 + a 3 < 0, which is a contradiction, 
since ai > 0, i = 1, • • ■ , 6. ■ 
Using this we can obtain the following lemma 
Lemma 1: Suppose that we have three source nodes. Let 
the joint entropy of the original source be h and let f op t2 
represent the optimal value of SUBSET-MIN-COST and f opt i, 
the optimal value of CODED-MIN-COST. Let b* and a* be 
the optimal value of b and a, in the coded case, respectively. 
If b* > 0, the costs for the coded case and the subset case will 
be the same. If b* < 0, f opt2 - fopti < (min ie g(di)) x |6*| < 
(mm ieS (di))h/2. 

Proof: When b* > 0, the subset case atom values equal 
to the coded case atom values, then the two cases have the 
same costs. When b* < 0, without loss of generality, assume 
that mirij e s(Gy = d\. As in the greedy algorithm above, 
we construct a feasible solution for SUBSET-MIN-COST by 
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keeping the flow values the same, but changing the atom values 
suitably. Let af,i = 1, . . . , 6, b 2 denote the atom values for the 
subset case. Consider the following assignment, 

a? = a?,i = l,...,5; a 2 6 = a* 6 - \b*\; b 2 =0. 

This is shown pictorially in Figure 
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— . j Corresponding subset case 

Coded case 

Fig. 5. The figure illustrates a transformation from the coded case to the 
subset case, when the first source has the minimal storage cost and b* < 0. 

We can check constraint $1% to see that the solution is 
feasible for the gap LP for three sources. It can also be verified 
that we can arrive at the above assignment by applying our 
greedy algorithm. Furthermore, on checking the KKT condi- 
tions of the gap LP, we conclude that the obtained solution 
is the optimal solution for the gap LP. xl \, £ E* are 
feasible for the subset problem. The flows do not change over 
transforming the coded case to the subset case. The only cost 
increased by transforming from the coded case to the subset 
case is d\ x < (mm ie s(di))h/2. ■ 

In the results section, we shall show an instance of a network 
where this upper bound is tight. 

Finally we note that, when there are only two source nodes, 
there is no cost difference between the subset case and the 
coded case, since for two random variables, all atoms have to 
be nonnegative. We state this as a lemma below. 

Lemma 2: Suppose that we have two source nodes. Let 
fopt-z represent the optimal value of SUBSET-MIN-COST 
and f op ti, the optimal value of CODED-MIN-COST. Then, 

fopt2 foptl- 

VI. Simulation results 

In this section we present an example of a network with 
three sources where our upper bound derived in Section IV-CI 
is tight. We also present results of several experiments with 
randomly generated graphs. The primary motivation was to 
study whether the difference in cost between the subset sources 
case and the coded case occurs very frequently. 

Consider the network in Figure [6] with three sources nodes, 
1, 2 and 3 and four terminal nodes, 6, 7, 8, and 9. The entropy 
of the original source = H(X 1 , X 2 , X 3 ) = 2 and all edges are 
unit-capacity. The costs are such that = l,V(i,j) G E and 
d\ = d,2 = 2, G?3 = 1. 




Fig. 6. Network with source nodes at 1, 2 and 3; terminals at 6, 7, 8 and 9. 
Append a virtual source S* connecting real sources. 



TABLE I 

Atom values when subset constraints are enforced 



Atom 


/**(•) 


Xi n ATf n x% 





xi nx 2 n 





Xi n x 2 n iff 


0.5809 


xi n x 2 c n x 3 





x,nx 2 c n x 3 


0.6367 


xi nx 2 nx 3 


0.7824 


Xi n x 2 nx 3 






The optimal cost in the subset sources case is 17. The 
corresponding atom values are listed in the Table U In this 
case we have 

In the coded sources case, the optimal value is 16, with 
H(Xi) = H(X 2 ) = H(X 3 ) = 1. Note that in this case the 
gap between the optimal values is precisely = | x 1 = 1, i.e., 
the upper bound derived in the previous section is met with 
equality. 

We generated several directed graphs at random with \ V\ = 
87, \E\ = 322. The linear cost of each edge was fixed to an 
integer in {1,2,3,4,5,6,29,31}. We ran 5000 experiments 
with fixed parameters (|jf?|, \T\, h), where |5| - number of 
source nodes, \T\ - number of terminal nodes, and h - entropy 
of the original source. The locations of the source and terminal 
nodes were chosen randomly. The capacity of each edge was 
chosen at random from the set {1, 2, 3, 4, 5}. In many cases it 
turned out that the network did not have enough capacity to 
support recovery of the data at the terminals. These instances 
were discarded. 

The results are shown in Table [TT] The "Equal" row corre- 
sponds to the number of instances when both the coded and 
subset cases have the same cost, and "Non-equal" corresponds 
to the number of instances where the coded case has a lower 
cost. We have found that in most cases, the two cases have 
the exact same cost. We also computed the gap LP and 
the greedy algorithm to evaluate the cost gap. Note that the 
gap LP is only an upper bound since it is derived assuming 
that the flow patterns do not change between the two cases. 
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TABLE II 

Comparisons of two schemes in 5000 random directed graphs 



(\S\,\T\,h) 


(3,3,3) 


(4,4,4) 


(5,5,5) 


(4,5,5) 


(5,4,5) 


(4,4,5) 


Equal 


3893 


2855 


1609 


1577 


2025 


1954 


Non — equal 


1 


3 


10 


9 


6 


8 



When (\S\, \T\,h) = (4, 3, 4), among 5000 experiments, 3269 
instances could support both cases. Out of these, there were 
481 instances where the upper bound determined by the gap 
LP was not tight. In addition, there were 33 instances where 
the greedy algorithm failed to solve the gap LP exactly. 

VII. Conclusions and Future work 

In this work, we considered network coding based content 
distribution, under the assumption that the content can be con- 
sidered as a collection of independent equal entropy sources, 
e.g., a large file that can be subdivided into small pieces. Given 
a network with a specified set of source nodes, we examined 
two cases. In the subset sources case, the source nodes are 
constrained to only contain subsets of the pieces of the content, 
whereas in the coded sources case, the source nodes can 
contain arbitrary functions of the pieces. The cost of a solution 
is defined as the sum of the storage cost and the cost of the 
flows required to support the multicast. We provided succinct 
formulations of the corresponding optimization problems by 
using the properties of information measures. In particular, 
we showed that when there are two source nodes, there is 
no loss in considering subset sources. For three source nodes, 
we derived a tight upper bound on the cost gap between the 
two cases. A greedy algorithm for estimating the cost gap 
for a given instance was provided. Finally, we also provided 
algorithms for determining the content of the source nodes. 
Our results indicate that when the number of source nodes is 
small, in many cases constraining the source nodes to only 
contain subsets of the content does not incur a loss. 

In our work, we have used linear objective functions. 
However, this is not necessary. We could also have used 
convex functions. That would simply not have allowed a LP 
formulation and the gap bound would be different. In our 
work, we have assumed that the locations of the source node 
are known. It would be interesting to consider, whether one 
can extend this work to identify the optimal locations of the 
source nodes, e.g., if an ISP wants to establish mirror sites, 
what their geographical locations should be. The gap between 
subset and coded sources shown here is for three sources. It 
would be interesting to see how it grows with the number of 
sources. We conjecture that the gap can be quite large when 
the number of source nodes is high. We have investigated the 
difference of the coded and subset case under a network with 
arbitrary topology. Examining this issue when the network has 
structural constraints (such as bounded treewidth J33)) could 
be another avenue for future work. 

Beyond gaps, there may be advantages to coding when 
we have multi-tiered distributed storage, such as in the case 



in current large CDNs. In that case, the subset approach 

would require extra constraints in the middle tiers that may 

be difficult to keep track of. The coded storage approach 
gracefully extends to a multi-tiered architecture. 

VIII. Acknowledgements 

The authors would like to thank of the anonymous reviewers 
whose comments greatly improved the quality and presentation 
of the paper. 

Appendix 

A. Proof of Theorem [7] 

(1) Independent random variables Wa,A G A, such that 
H(Wa) = ot{A) can be constructed l2TI . Then we can set 
Xi = (Wa ■ A G A, A C Xi). It only remains to check 
the consistency of the measures. For this, we have, for all 

VCAf s , 

H(X V )= h {Wa), (20) 

AeA:AcX v 

using the independence of the Wa's. On the other hand we 
know that 

H(X V ) = ^(X V ) = Yl V*( A )- < 21 ) 

AeA:AcXv 

Equating these two we have, for all V C Ms, 

J2 h(w a )= y f**w- (22) 

AeAAcXv AeAAcX v 

Now, one possible solution to this is that (J,* (A) = 
H[Wa),VA G A. By the uniqueness of fi* 12T1 . we know 
that this is the only solution. 

(2) We shall prove all the measures are nonnegative by 
induction. Without loss of generality, we can order Xj's in an 
arbitrary way, we analyze the measure /j,*(XiP\- ■ -OXiOk-.keK 
XI) where K C M s \ {1, 2, • • • , 1} ,1 < n. 

When I = 1, the measure corresponds to conditional 
entropy, VA C M s \ {1} 

H*{X X n k -.keK X c k ) = H{X X \X K ) > o. 
When I = 2, we have, VA C M s \ {1, 2} 

//(Xj n x 2 n k:keK x c k ) = i{x 1 -x 2 \x K ) 

= H{X U X K ) + H(X 2 ,X k ) -H{X k ) -H{X 1 ,X 2 ,X k ) 

i€V 1 nv 2 n k[keK v^ 

Assume for I = j, VA' C Ms \ {1, 2, • ■ ■ , j}, the following 
statement holds, 

fj,*(x 1 n---nx j n k : keK xz)= ]T fr(z<). 

iev 1 n---nv J n k -. k£K vc 

(23) 
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When I 
have 



j + 1, VK C J\f s \ {1,2, • • • ,j + 1}, we shall 



l x*{X 1 n---nx j+1 n k : keK x%) 

- n*(Xx n • • • n n n fe:fcejR: X£) 
= E E 

iev 1 n---nv j n k[keK v l ^ iev 1 n-nv j nv? +1 n k .. keK v^ 

= E #(^) ^ °- 

iGyin---ny j+ in fc:fce /fy fc c 

The equation (a) is due to the assumption d23l ). The equation 
(6) is due to the independence of Zj's, i G {1, . . . , m}. There- 
fore, we have shown that j < n, MK C A/g \ {1,2, • ■ ■ , j}, 

M*(X 1 n---nx j n fe:/ceif x^)= E #(^)>o 

ieyin---ny,n fc:fceA 'y f f 

In a similar manner it is easy to see that all atom measures 
are non-negative. 
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